Clustering as a measure of the local topology of networks 
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Usual formulations of the clustering coefficient can be shown to be insufficient in the task of 
describing the local topology of very simple networks. Motivated by this, we review some alternatives 
in order to present an extension, the clustering profile. We show, both conceptually and through 
applications to well studied networks, that this measure is a more complete and robust measure of 
clustering. It imposes stringent constraints on theoretical growth models, specially on aspects of 
the network structure that play a central role in dynamics on networks. In addition, we study how 
it provides a richer perspective of phenomena such as hierarchy, small-worlds and clusterization. 
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I. INTRODUCTION 

Physicists have been greatly interested in studying the 
interplay between the topology and growth dynamics of 
complex networks, for they are a universal framework 
to understand various processes in previously distant 
areas ranging from cell metabolism to linguistics Q, 0- 
The unusual nature of their structure, represented by 
graphs^, required the development of novel methods, 
and one of the core difficulties has been to understand 
what measurements are robust and indeed represent uni- 
versal quantities by which we can compare and clas- 
sify networks, or assert how effectively a network growth 
model reflects the network it tries to mimic. 

Average distance, degree distribution and degree 
correlations^ , loopinessQ, motifs@ and the clustering 
coefficient yj are some quantities which have established 
themselves as useful. But, among them, only the lat- 
ter gives any description of the local topology — how 
the network is organized close to some vertex — measur- 
ing how many of its neighbors are also neighbors among 
themselves. It has also been established that local struc- 
ture is not only an important topological characteristic, 
but also a main concern when studying sear chability|ll3 
and dynamics [Tol ITlj on networks. 

Still, important as it stands, the usual clustering coef- 
ficient has serious limitations, and is commonly replaced 
by ad hoc definitions. We first focus on these shortcom- 
ings, as they'll let us understand why so many variations 
emerged, and by summarizing this variations propose a 
consistent and robust improvement, which should pro- 
vide for a more complete and less specialized characteri- 
zation of the local topology of networks. 

We remark that our focus is not on clustering as 
transitivity ^2 j but as a general measure of small-scale 
structural organization. 
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II. BACKGROUND 

When first introduced^] to formalize the small-world 
phenomena, the clustering coefficient was defined as the 
fraction of connected pairs among the neighbors of a ver- 
tex, averaged over all vertices. 

Following that, it was also defined by taking the frac- 
tion of connected pairs over all pairs of neighbors of all 
vertices [T^j. since that frequently is more tractable. In 
this form, it was used to derive analytical expressions 
that distinguish social networks from other net works [14j. 

A first problem with the coefficient is that these 
two seemingly close definitions may yield very different 
results |l5|. as can be seen from the fact that the individ- 
ual ratio for vertices of high degree has greater impact 
on the second case. 

Later, a better description of clustering was introduced 
by considering the coefficient as a function of vertex de- 
gree, and doing so revealed important structural features 
such as hierarchical organization^^. 

Another interesting consequence is that this also re- 
moves the ambiguity of the first two definitions: if av- 
erages are restricted to a set of vertices of same degree, 
averaging the coefficient over vertices or calculating the 
fraction of connections over all pairs of neighbors is the 
same thing. 

Still, other definitions have become necessary, as in the 
case of bipartite networks used in the study of sexually 
transmitted diseases ^1) , where no odd cycles exist and 
thus, because connected neighbors of a vertex would form 
cycles of length 3, the clustering coefficient is always zero, 
even though a bipartite network may have a very complex 
local structure. 

The underlying issue becomes more evident when not- 
ing that both a square lattice and a single large cycle, 
which are locally very different networks, have their clus- 
tering coefficient equal to zero. 

We take this as evidence that a general treatment of 
clustering cannot expect to rest on a simple scalar quan- 
tity; however, it should remain a local property that can 
be calculated for each vertex, and comparable among 
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FIG. 1: Higher orders of clustering, C d (v): the fraction of 
pairs of neighbors whose smallest cycle shared with the vertex 
has length d. In the example above, v has 10 pairs of neighbors 
and a non-zero clustering for d = 1, 2. C 1 (u) is, by definition, 
the usual clustering coefficient. 



them. Thus the source of the problem seems to be the 
question asked, of "how many of my neighbors are con- 
nected?", instead of "how closely related are my neigh- 
bors?" , which we think better comprehends the concept 
of clustering as a measure of local topology. 

In order to implement this idea we turn back to the 
original definition of clustering for a vertex, the fraction 
of connected pairs among its neighbors, and notice that 
"connected" can stand for "whose distance is 1". Then 
the usual clustering coefficient can be understood as the 
first term of a sequence, the term accounting for the frac- 
tion of pairs of neighbors whose distance is 1. The second 
term would be the fraction of pairs of neighbors whose 
distance is 2, and so on. Now, since those are all neigh- 
bors of the same vertex, they are always connected by 
a path of length 2 going through that vertex, and so we 
must discard paths going through the vertex in question 
when calculating this distance between its neighbors. 

These higher order clustering coefficients were first 
explored by A. Fronczak et al. 18] in order to show, 
in a stron ger sense, that the model of preferential 
attachment |l9| is blind to clustering mechanisms. 

The meaning of this extended clustering can, perhaps, 
be more clearly understood in terms of cycles: the nth 
term of the sequence would correspond to the fraction of 
pairs of neighbors of a vertex whose smallest cycle shared 
with it has length n + 2; as can be seen on FIG. ^ 

So, by combining these dependencies on degree and 
distance, we obtain a measure that is neither ambiguous 
nor specialized; which not only applies to all networks, 
but is a richer description of their local topology. For 
practical purposes, instead of referring to higher order 
clustering coefficients averaged as a function of degree, 
we name this quantity the clustering profile, and proceed 
to a formal definition. 



III. CLUSTERING PROFILE 

In a network G composed of vertices V and edges E, so 
that G = (V, E) , we denote the clustering for a vertex u G 
V as C d (u), defined as the number of pairs of neighbors 
of u whose distance in the induced network G(V \ {u}) 
is d, divided by the total number of pairs of neighbors of 
u. Thus 

nd , , _ \{{v,w};v,w e N(u)\d G{v \ {u}) (v,w) = d}\ 

(1) 

where N(-) is the set of neighbors of a vertex, the mod- 
ulus | • | represents the cardinality (number of elements) 
of a set, and so |iV(u)| is the degree of u, also denoted 
deg(u). 

This leads to a generalized description of how the net- 
work is organized around that vertex, reflecting the con- 
tribution of more distant neighbors in higher terms, while 
still preserving the good property that, when summed 
over all terms, it ranges between and 1. 

We can then define the clustering profile for a network, 
being the average of C d (u) over all vertices of same degree 
k, and denote it 

k \{u\deg(u) = k}\ ■ 1 1 

It should be noticed that the usual clustering coeffi- 
cient as a function of degree is simply C\. 

And although numerically calculating the clustering 
profile is a more expensive computer operation than cal- 
culating the usual coefficient, each step of the calculation 
is parallclizable, so even large networks can be treated 
with relatively small computer resources. 

IV. APPLICATIONS 

In order to illustrate the consequences of the cluster- 
ing profile, we choose a well known network: the set of 
metabolic networks of bacteria first studied in references 
0,|23|, where a growth model0 is provided to illus- 
trate their hierarchical organization. Comparing the re- 
sults for the real network and this hierarchical model will 
help us better understand the phenomena associated with 
this system. We first focus on the profile's absolute value, 
then its variation with distance, and finally its variation 
with degree, as each of these will have distinct implica- 
tions. 



A. Small-worlds 

FIG. [21 metabolic) shows the clustering profile for the 
metabolic network. We note that based on the usual co- 
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FIG. 2: The well studied metabolic network, above, has a 
richer local topology than one can see with the usual cluster- 
ing coefficient, since C\ decreases quickly (d = 1) with degree, 
while deeper measures (d = 2, 3) reveal the structure present 
even in high degrees. The hierarchical model, shown below, 
has a much simpler structure and does not reflect the impor- 
tance of distance 2 relationships present in the real network. 



efficient alone (d — 1), one could suggest that more con- 
nected metabolites live in an almost unclustered world. 

In contrast, the profile not only shows that more pairs 
of neighbors are related by the second order of clustering 
than the first, but makes it clear that, as degree raises, 
clustering only migrates from distance one and two to 
distance three and four. 

In fact, we verify that the sum of Cjr? up to d = 5 is 
close to 1.0 for all k, which implies that almost every two 
neighbors of a vertex share with it a cycle of length in 
the order of the network's average distance. 

Now, as can be seen from FIG. |2J model), the growth 
model shares neither of these properties. 

Contrary to the metabolic network, distance 2 does 
not play a major role on its local topology. Also, the 
model's sum of all orders of the clustering profile quickly 
converges to ^ for vertices of high degree, meaning that 
close to them it has a very different organization from 
that of the real network, being more open or tree-like. 



So, although these networks are both considered small- 
world networks, in the sense that they have small average 
distance while maintaining a significant usual clustering 
coefficient, our observations with the clustering profile 
sets them clearly apart: there are networks where almost 
every vertex experiences, in a generalized sense, the small 
world phenomena, meaning their neighbors are all closely 
related, while on other networks this effect is restricted 
to a subset, out of which nodes have groups of neighbors 
who, if not for them, would live in distant clusters. 

This is not only evidence of very different network- 
growth processes, but we note that this distinction re- 
flects a strikingly different local topology around the 
high degree vertices, also called "hubs" in the litera- 
ture, which play a central role in many dynamical pro- 
cesses, notably in disease spreading |l If and information 
retrieval [2j]]. Therefore, establishing this distinction is an 
important step towards a more structured understanding 
of both the growth of, and the dynamics on, small-world 
networks. 

From the arguments above, it is useful to define 
"complete-small-world networks" as those networks with 
small average distance and, for all degrees, the sum of 
their clustering profile up to an order of that average dis- 
tance close to 1. As we noted for the metabolic network, 
this is equivalent to requiring that vertices share short 
cycles with most of their pairs of neighbors, regardless of 
degree. 



B. Hierarchy 

FigEl shows the same data scaled to a log-log scale, 
in order to visualize the power-law behavior of cluster- 
ing which characterizes hierarchy. We note that, on the 
metabolic network, C£ for all d varies as a power law 
in k over a wide range of degrees, indicating a deeper 
hierarchical structure than previously known. 

For the model, however, other than the usual cluster- 
ing coefficient C\ : all additional orders differ from the 
behavior of the real network: Cf and C| can hardly be 
considered hierarchical, and all orders greater than 3 are 
constantly equal to zero. 

While clearly this model was crafted only to illustrate 
the idea of hierarchy, we have been able to spot another 
important feature of the metabolic network's topology, 
its deep hierarchical structure, that is missing from the 
model and would be relevant when studying, for example, 
flow-dynamics [22,113 m bacteria metabolism. 



C. Clusterization dynamics 

At last, we consider the behavior of the profile as a 
function of distance, for specific ranges of degrees, in or- 
der to examine the change in influence of growth dynam- 
ics over varying orders of clustering. We choose not to use 
the hierarchical model this time because, its profile being 



4 



-K---X- 



a- a ""SL 




metabolic 

d 1 i — l — 

2 H-X-H 

3 

4 :--E-- 

5 ' — ■ — i 



100 

A' 




model 

d 1 i f— 

2 f— X- 
3 



100 

fc 



FIG. 3: In a log-log scale, we can see the metabolic network 
has a wide range with power-law behavior even for higher 
distances. The model, however, lacks this deep hierarchical 
structure, and has a zero profile for orders above 3. 



zero beyond order 3, it is of questionable significance to 
consider its variation. 

Instead, we compare the metabolic network with an- 
other small-world network of similar global character- 
istics, the World Wide W eb 1241 . Both are scale-free 
dissortative[2^ hierarchical|26j networks. Our purpose 
here is to ask if the evolution from one order to another 
might give more clues about their underlying clusteriza- 
tion dynamics, since these networks are similar in every 
other aspect. 

However, given that the absolute value of clustering 
depends on degree, there is little reason to suppose its 
variation with order would remain the same. Therefore, 
we split each network into three fractions: small degrees, 
medium degrees and large degrees, based on the behavior 
of the degree variation for the clustering profile on these 
networks. Medium degrees are those where it behaves 
well as a power law for various orders, small and high de- 
grees are those under and above that range, respectively. 

We verify that, for low and medium degrees, these net- 
works show the same behavior, namely an exponential 
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FIG. 4: Variation with distance, in log-log scale, of the aver- 
age clustering over vertices with high degree. The metabolic 
network shows exponential behavior (see inset in monolog 
scale), while the WWW appears as a power law. 



decrease of the coefficients with increasing order. But 
as we see on FIG.0] for high degrees the metabolic net- 
work remains exponential, while the WWW behaves as a 
power law, revealing a structural change around its most 
connected vertices. 

This adds a new approach for evaluating and improv- 
ing various network growth models that use community 
structure^, edge addition bias[28|,|2jj and other strate- 
gies to explain the diverse mechanisms of clustering for- 
mation found in natural, social and technological net- 
works. In the present case, it suggests a mechanism of 
clusterization dynamics more sensible to degree variation 
for the WWW. 

We also observe that the first one or two orders of clus- 
tering on these networks do not uniformly follow such 
scales. In the metabolic network the first order is always 
lower than extrapolation would suggest and, for high de- 
grees, so is the second order. In the WWW, however, 
that only happens on high degrees, and only to the first 
order of clustering, as can be seen in the picture. 

It is not clear why such a deviation occurs, but we sup- 
pose it is a consequence of superimposed dynamic rules 
affecting clusterization. 

On the metabolic network, we suggest this effect might 
be related to selective pressure against congestion |22l l30| 
of metabolic pathways, since it is always present and af- 
fecting deeper levels of clustering, with the effect getting 
stronger towards highly connected metabolites. 

As for the WWW, this poses one interesting question: 
whether the competing rule is only suppressing the lower 
orders of the profile on high degrees, or whether its inter- 
action with the dynamics is also the cause of the profile's 
deviation from exponential behavior. 

We intend to pursue these questions on a future study 
of network growth models. 



5 



V. CONCLUSIONS 

In this paper we reviewed some issues, limitations, 
and alternatives to the clustering coefficient, a measure 
which is a pivot concept in the field of complex networks. 
Through a more comprehensive formulation of this con- 
cept, new insights were given for networks well studied in 
the literature, specially when considering network growth 
processes. Not limited to that, the new measure presents 
us a broader view of usual phenomena related to network 
structure dynamics, such as the emergence of clustering, 
hierarchy, and the small- world effect. It also seems to be 
most consequential for well connected vertices, which are 
central in many networked processes. As a final state- 
ment, we believe that less specialized, richer descriptions 
will allow local topology to play a more significant role 
in understanding the interaction between structure and 
dynamics on networks. 

VI. MATERIAL AND METHODS 

The data used in this paper for the metabolic 
network and the WWW was that available on the 



website of the CCNR at Univ. of Notre Dame, 
http://www.nd.edu/~networks/ . The metabolic net- 
work was reduced according to the procedure in the sup- 
plemental online material of reference |16| . 

Non scaled graphics for the profile (Fig. |SJ) were re- 
binned, log-log graphics for the profile (Fig. 01 were log- 
rebinncd, and the error bars are those from the rebinning 
process. 

Higher order clustering coefficients and the cluster- 
ing profile were calculated with use of the graph- 
tool, which is publicly available as Free Software [3 1| at 
http://graph-tool.forked.de/ . 
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